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ABSTRACT 

Based on initial Scholastic Aptitude Test (SAT) 
Verbal pretest data and hypotheses advanced in the research 
literature, 7 sentence completion and 16 analogy items with extreme 
levels of differential item functioning (DIF) were selected and then 
systematically revised and re-administered in an attempt to reduce or 
eliminate DIF. The apparent success of the effort makes similar 
attempts worth continuing. The particular terminology used in stems 
and keys, rather than the underlying skill being measured, seemed to 
be a recurring source of DIF in the SAT-Verbal items. Larger sample 
sizes, especially for minority focal groups, would help to stabilize 
the DIF categories used by Educational Testing Service (ETS) test 
developers. In addition, because the ETS delta metric is unbounded at 
the extremes, the use of both the Standardization (p-tnetric) and 
Mantel-Haenszel (delta-metric) methodologies is recommended for 
classifying the level of DIF for very easy and very difficult items. 
Further research is suggested to study the possible relationship 
between DIF and predictive validity. Nine tables and four figures 
present analysis results. An appendix summarizes DIF hypotheses. 
(Contains 26 references.) (Author/SLD) 
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Abstract 



Based on initial SAT- Verbal pretest data and/or hypoth- 
eses advanced in the research literature, the authors se- 
lected 7 sentence completions and 16 analogies with ex- 
treme levels of differential item functioning (DIF) and 
then systematically revised and readministered the items 
in an attempt to reduce or eliminate DIF. Several diverse 
conclusions can be drawn from the data. First, because 
of the apparent success in reducing extreme levels of DIF 
in SAT- Verbal items, the authors recommend that such 
efforts be continued. Second, the particular terminology 
used in stems and keys (rather than the underlying rea- 
soning skill being measured) seems to be a recurring 
source of DIF in SAT- Verbal items. Third, larger sample 
sizes, particularly for minority focal groups, would help 
to stabilize the DIF categories used by Educational Test- 
ing Service (ETS) test developers. Fourth, because the ETS 
delta metric is unbounded at the extremes, the use of both 
the Standardization (p-metric) and Mantel-Haenszel 
(delta-metric) methodologies is recommended for classi- 
fying the level of DIF for very easy and very difficult 
items. Finally, the paper concludes with a suggestion for 
further research concerning the possible relationship be- 
tween DIF and predictive validity. 



Introduction 



Differential item functioning (DIF) statistics can be used 
to identify test questions on which the various focal (mi- 
nority or female) and reference (white or male) popula- 
tions perform differently. Since the mid-1980s, a series 
of DIF studies on the operational verbal sections of 
the SAT has been conducted to identify and assess the na- 
ture of the items on which DIF can be observed (Schmitt 
1985; Bleistein and Wright 1986; Wendler and Carlton 
1987; Rogers and Kulick 1987; Schmitt and Bleistein 
1987; Schmitt 1988; Lawrence, Curley, and McHale 
1988; Lawrence and Curley 1989: Schmitt and Dorans 
1990). 

In addition, randomized studies of specially con- 
structed items have been undertaken in an attempt to iso- 
late and evaluate factors — both within and across item 
types and testing programs — that may consistently result 
in elevated levels of DIF for one or more focal groups 
(Scheuneman 1987; Dorans, Schmitt, and Curley 1988; 
Scheuneman and Briel 1988; Schmitt, Curley, Bleistein, 
and Dorans 1988; Bleistein, Schmitt, and Curley 1990). 



This latter, group of studies has generally used items writ- 
ten specifically for the study or pretested items on which 
no DIF data were yet available; these items have been 
administered in nonoperational sections that did not 
count as part of the examinees' scores. 

Although findings from the randomized studies have 
clarified some factors previously hypothesized to be re- 
lated to DIF, they have also shown that elevated levels of 
DIF cannot be completely eliminated at the item-writing 
stage because the factors are confounded or as yet un- 
identified. The mere flagging of an item for DIF does not 
indicate the reason(s) for the differential functioning. 
Thus it is likely that some SAT items with elevated levels 
of DIF will continue to be found at the pretest stage even 
if test developers were provided with item-writing guide- 
lines and/or if changes were made in test specifications 
to reduce DIF. Of the approximately 2,250 SAT- Verbal 
(SAT-V) questions pretested during 1990, 190 items 
(about 8.5 percent) exhibited moderate to large amounts 
of DIF. This total includes items differentially advantag- 
ing, as well as disadvantaging, focal groups, and includes 
questions from all four of the verbal item types (ant- 
onyms, analogies, sentence completions, and reading 
comprehension). 

Elevated levels of DIF in and of themselves do not 
prove that test questions are biased. Once an item is 
flagged for high DIF, judgment should be used to decide 
whether the difference in difficulty shown by the DIF in- 
dex is unfairly related to group membership. The deter- 
mination of fairness should be based on whether or not 
the difference in difficulty is judged to be relevant to the 
construct being measured by the test (Zieky 1991). For 
the purposes of this study, the authors selected pretested 
items with elevated levels of DIF that had not yet been 
evaluated with respect to whether or not the DIF was 
construct-relevant. The majority of the items studied con- 
tained specialized terminology likely to be found in the 
material read and viewed, or in the language used, or in 
the experiences engaged in, by one gender or minority 
group more often than by another because of their par- 
ticular interests or opportunities. 

The chief purposes of this study were twofold: (1) 
to revise individual verbal items on which elevated levels 
of DIF had been observed at the pretest stage to try to 
reduce or eliminate the DIF and thus make the items ap- 
propriate for use in operational forms of the SAT; and 
(2) to continue to evaluate and perhaps to supplement 
hypothesized DIF factors for the SAT by observing the 
effects of revisions on individual items previously exhib- 
iting DIF. 
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Method 

Data Source 

The data for this study were collected (1 ) initially in pre- 
test sections at various regular Saturday administrations 
of the SAT and then, after the selected items were system- 
atically revised and reassembled into four 30-minute 
nonoperational sections, (2) at a regular 1991 SAT ad- 
ministration. At both stages of data collection the data 
consisted of unscored item responses to nonoperational 
questions from random samples of self-reported females, 
males, Asian Americans, blacks, Hispanics, and whites 
for whom English was (or English and another language 
were) their first language(s). Sample sizes for the analy- 
ses of the four newly assembled forms (labeled A to D) 
ranged from 4,331 white examinees for Form A to 183 
Hispanic examinees for both Forms C and D (see Table 
1). Data from the earlier, initial pretesting of the selected 
items were based on groups of examinees that were 
roughly proportional in sample size to the groups re- 
ported in Table 1. 

Instrument and Design 

Based on initial pretest DIF data and/or hypotheses pro- 
posed in previous research (see the Appendix), a total of 
7 sentence completion and 1 6 analogy items were selected 
as the focus of investigation. These items were then modi- 
fied in ways intended to eliminate the factors hypoth- 
esized to be related to the elevated levels of DIF observed 
at the initial pretesting. Whenever possible, given the 
available "pool" of SAT items showing moderate to large 
amounts of DIF, sets of two or more items with the same 
(or similar) hypothesized DIF factors were included in 
this study for purposes of internal replication. The dif- 
ferent versions of each of the 23 items studied are dis- 
played in Tables 2 to 8 in terms of the various hypoth- 
esized DIF factors. 

Seven DIF factors were examined. Not all (or even 
most) items in the following seven general categories con- 
sistently show elevated levels of DIF, but certain patterns 
have been detected. Science, industrial arts, and military 
terminology, as well as contexts portraying aggression or 
conflict, may negatively affect the performance of fe- 
males, based on what has been found in the evaluation 
of some SAT-V items and/or in the research literature. 
Terminology of special interest or familiarity to particu- 
lar groups may positively affect the performance of those 
groups. Cognates with Spanish, especially when they ap- 
pear in the stem or key of SAT-V questions, may posi- 



table 1 



Sample Sizes and Difficulty Estimates (P%) for Study Groups 
across Forms A, B, C, and D 



Form A 


Form B 


Form C 


Form D 


WHITE 


.V 


4,331 


4,112 


4,061 


3,856 


Mean P% 


56 


58 


60 


56 


S.D. P% 


26 


23 


22 


25 


HISPANIC 


N 


199 


235 


183 


183 


Mean P% 


45 


50 


49 


49 


S.D. P% 


24 


21 


21 


24 


BLACK 


S 


621 


573 


575 


565 


Mean P% 


39 


41 


43 


41 


S.D. P% 


23 


21 


21 


21 


ASIAN AMERICAN 


N 


259 


270 


237 


226 


Mean P% 


60 


59 


65 


60 


S.D. P% 


23 


22 


20 


24 


MALES 


iV 


2,511 


2,582 


2,488 


2,311 


Mean P% 


55 


56 


59 


55 


S.D. !•% 


25 


23 


22 


24 


FEMALES 


N 


3,028 


2,746 


2,689 


2,625 


Mean P% 


53 


56 


57 


54 


S.D. P% 


25 


23 


22 


25 



tively affect the performance of Hispanic examinees. Ho- 
mographs, especially when they appear in the stem or key 
of SAT-V questions, may negatively affect the perfor- 
mance of Hispanic, black, and Asian American examin- 
ees. These are the seven DIF factors examined in the 
present investigation. 

The original versions of the items studied (worded 
identically to the initial pretests) and as many as three 
different revised versions of each were assembled into 
four sections of the SAT for re-pretesting. Each section 
consisted of 40 verbal questions presented in the same 
order as that of the 40-item operational section of the 
SAT-V: items 1 to 10 were identical antonym questions 
across the four forms and were not part of this investiga- 
tion; items 1 1 to 1 5 were sentence completions; items 1 6 
to 25 were analogies; and items 26 to 40 were reading 
comprehension questions and not part of this investiga- 
tion. Thus, Forms A, B, C, and D were indistinguishable 
from the operational sections of the SAT-V (as were the 
earlier verbal pretests from which the initial DIF data 
were derived). 

Original and revised versions of items were kept in 
the same position across either two or four of Forms A, 
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B, C, and D. For example, item 11 in Forms A, B, C, and 
D presented four different versions of the same sentence 
completion; item 12 in Forms A and B presented two dif- 
ferent versions of a second sentence completion; item 12 
in Forms C and D presented two different versions of a 
third sentence completion; and so on. Each of the four 
forms was constructed in such a way that it would not 
violate any of the usual SAT-V pretest assembly guide- 
lines. To the extent possible, the difficulty of the alter- 
nate versions of the items was kept parallel; however, 
insofar as word substitutions were based on the subjec- 
tive judgments of the authors, alternate versions of some 
items were found to differ in difficulty. 

Variables such as order of answer choices (A to E), 
key position, and content classification (unless it was as- 
sociated with the hypothesis being evaluated) were held 
constant for the alternate versions of each item studied. 
The factors hypothesized as causes of the elevated DIF as 
well as the various groups differentially affected by the 
items studied were also carefully balanced across the four 
forms so that no one section of re-pretested items would 
include a preponderance of questions likely to affect any- 
particular group either negatively or positively. 

In addition to reviews by the authors, each item stud- 
ied and its alternate version(s) were also reviewed by two 
test development colleagues familiar with the SAT-V and 
with relevant DIF research. After the pretested items were 
assembled into the four sections for this study, each of 
the variants passed through routine test specialist, edit- 
ing, sensitivity, and planograph reviews. This review pro- 
cess assured that the four nonoperational forms from 
which data for this investigation were derived were com- 
parable to regular operational and pretest sections of the 
SAT-V. 

Procedure 

Items that are more difficult for one group than for an- 
other with :he same level of ability or skill are defined as 
differentially more difficult or as functioning differen- 
tially between the two groups. Usually the white or male 
group is referred to as the reference or base group and 
the minority or female group as the focal or study group. 
Since DIF indices take into account overall differences 
in ability on the construct being measured by matching 
the groups before comparing their performance, DIF in- 
dices identify items that might have construct-irrelevant 
characteristics. 

Two statistical procedures currently used at ETS to 
assess DIF are the Mantel-Haenszel (MH) method (Hol- 
land and Thayer 198S) and the Standardization (DSTD) 
method (Dorans and Kulick 1983, 1986). Both of these 
methods identify DIt 7 after partitioning the reference and 
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focal groups into subgroups with the same score on a 
relevant matching variable. The matching variable is usu- 
ally the total score on a test closely related to the con- 
struct that the item is intended to measure. While there 
are some minor differences between the MH and DSTD 
methods (Dorans and Holland 1992), the DIF estimates 
computed by these methods are highly correlated (in the 
upper .90s) because they tend to yield the same rank or- 
der of items with respect to DIF (Wright 1987; Holland 
and Thayer 1988; Dorans 1989). Both the DSTD and 
MH indices take into account speededness in the calcu- 
lations of DIF by including only those examinees who 
reached an item in the calculation of the DIF value for 
that item (Schmitt and Bleistein 1987). 

Standardization Procedure 

In the traditional Standardization analysis, an item is said 
to exhibit differential item functioning when the prob- 
ability of correctly answering the item is lower or higher 
for examinees from one group than for equally able ex- 
aminees from another group. The focus of DIF analyses 
is on differences in performance between groups that are 
matched with respect to the ability, knowledge, or skill 
of interest. 

The basic elements of a Standardization analysis of 
the keyed response are proportions correct at each level 
of a matching variable, such as total score, in a base or 
reference group and a focal or study group. Standardiza- 
tion provides the DSTD index for quantifying DIF in the 
p metric. This index can range from -1 to +1, or from 
-100 percent to 100 percent. Negative values of DSTD 
indicate that the item disadvantages the focal group, 
while positive values indicate that the item favors the 
focal group. STD P- DIF values between -.05 (-5 percent) 
and +.05 (+5 percent) are considered negligible. STD P- 
DIF values outside the -.10 and +.10 (or the -10 percent, 
+ 10 percent) range are considered sizable. For opera- 
tional purposes, a IDSTDI>.10 is a recommended cutoff; 
for exploratory research purposes, a less reliable cutoff 
of IDSTDI>.05 is often used. In addition to calculating 
DSTD values for the key, differences in the standard- 
ized proportion of responses for each distractor are also 
computed and studied to understand better the effects 
of the hypothesized DIF factors (see Dorans, Schmitt, 
and Bleistein, 1 992, for a description of distractor analy- 
ses). 

Mantel-Haenszel Method 

The Mantel-Haenszel procedure (Mantel and Haenszel 
1959), adapted by Holland and Thayer (1988) for DIF 
analysis, computes ratios of the conditional odds of suc- 
cessful reference group performance over the conditional 
odds of successful focal group performance at each score 
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level, and then averages these ratios across score levels. 
In the calculation of the average ratio, statistically opti- 
mal weights are used for each ratio. The Mantel-Haenszel 
method provides an estimate of the constant odds-ratio. 

The MH statistic is transformed to the "delta" met- 
ric used to indicate item difficulty in the ETS test devel- 
opment process. To obtain a delta, the proportion cor- 
rect (p) is converted to a z-score via a p-to-z transforma- 
tion using the inverse of the normal cumulative function, 
followed by a linear transformation to a metric with a 
mean of 13 and a standard deviation of 4. Large values 
in a delta metric correspond to difficult items, while easy 
items have small delta values. This MH estimate of DIF 
effect size in the delta metric ranges from negative infin- 
ity to infinity, with a value of 0 indicating no DIF. 
MH D-DIF values between -1.00 and +1.00 are consid- 
ered negligible. MH D-DIF values outside the -1.50, 
+ 1.50 range are considered sizable. For operational pur- 
poses, IMH D-DIFI>1.50 is a recommended cutoff; a less 
reliable cutoff of IMH D-DIFI>1.00 is often used for ex- 
ploratory research purposes. As with DSTD, positive 
values of MH D-DIF favor the focal group, while nega- 
tive values disadvantage the focal group. For a complete 
description and comparison of the DSTD and MH D-DIF 
statistics, refer to Dorans and Holland (1992). 

In the present investigation, categorization of DIF 
items was made on the basis of the standard ETS DIF 
operational item screening classifications (Petersen 
1988). These classifications are as follows: 

(1) "A" items have a MH D-DIF not significantly 
different from 0 (at the .05 level) or an absolute value less 
than 1.00; (2) "B" items have a MH D-DIF significantly 
different from 0 (at the .05 level) and either an absolute 
value of at least 1.00 but less than 1.50 or an absolute 
value of at least 1.00 but not significantly greater than 
1.00 (at the .05 level); (3) "C" items have an absolute 
value of MH D-DIF of at least 1.50 and significantly 
greater than 1.00 (at the .05 level). 

Matching Criteria 

The analysis of differential item functioning involves a 
two-step process to refine the matching criteria. During 
the first step, the total-test raw score on the SAT-V op- 
erational 85-item test is used as the matching criterion to 
determine DIF for each of the 85 items. On the basis of 
the initial analysis, any item with extreme DIF values for 
the corresponding focal group comparison is removed as 
part of the total score used to match the reference and 
focal groups. Thus, a "refined" matching criterion is de- 
termined for each focal group comparison for use in the 
subsequent pretest DIF analyses. In this study, two items 
were identified as having extreme DIF (one for both black 
and Asian American examinees and one for only Asian 



American examinees) and were, therefore, deleted from 
the total score matching criterion for the respective focal 
group analyses. Thus two refined matching criteria were 
created: (1) for the white and Asian American examin- 
ees: SAT-V = 83 (85 items minus 2 items) and (2) for the 
white and black examinees: SAT-V = 84 (85 items minus 
1 item). For the other focal groups (i.e., Hispanic and fe- 
male examinees), the matching criterion was the total 
score on the 85-item SAT-V operational test. 

Results and Discussion 

Difficulty estimates for the 40-item special pretest sec- 
tions and sample sizes for each group studied are pre- 
sented in Table 1. Spiraling of the forms randomized their 
presentation, and no differences in the ability of the 
groups across the samples taking the four forms were 
expected or observed from mean verbal scores; the 
groups were judged to be essentially parallel across all 
four forms. Difficulty estimates for the four forms indi- 
cated that, for the most part, the four forms were paral- 
lel in difficulty. Form A appeared slightly more difficult 
and Form C slightly easier for most groups studied but, 
in general, there was a close correspondence among 
means and standard deviations for the difficulty estimates 
across all four forms. 

Scatterplots of difficulty and discrimination indices 
(p-values and R-Biserials) are presented in Figures 1 to 4. 
Figures 1 and 3 present the p-values and Figures 2 and 4 
the R-Biserials between Forms A and B and Forms C and 
D, respectively. These figures show that for those items 
where the indices follow the diagonal line, the difficulty 
and discrimination indices remained parallel across 
forms. More outliers are noted on the difficulty plots than 
on the discrimination plots. There are about six items 
per pair of forms with difficulty differences greater than 
15 percent. Most of these items were revised by chang- 
ing words that differed in difficulty, thus affecting the 
difficulty of the total item. This shift in difficulty was 
not totally unexpected because, although an effort was 
made to maintain the relative difficulty of parallel items, 
previous studies have shown that changes of one word 
can alter the difficulty of the item (Schmitt, Curley, 
Bleistein, and Dorans 1988; Bleiste*^, Schmitt, and 
Curley 1990). 

Tables 2 to 8 present the statistical results associated 
with all of the different versions of the items studied in 
this investigation. The tables are organized with the origi- 
nal reprinted version of each question always appearing 
first (in the left column), regardless of which of the four 
forms that version may have appeared in. Look, for in- 
stance, at Table 2, item 1 1 . The wording of the question 
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figure 1. Scatterplot of difficulty estimates between pairs of 
items: Forms A and B. 
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figure 3. Scatterplot of difficulty estimates between pairs of 
items: Forms C and D. 

as it was initially pretested and as it was reprinted for this 
study (with a key of "curb. .predators") appears first, 
along with the new data. The data indicate that the item 
was classified as negative "C" for female examinees us- 
ing the MH metric (-2.15); for Hispanic, black, and 
Asian American examinees, the item was classified "A." 
This version of the question appeared in Form A and was 
answered correctly by 73 percent of the total population; 
the R-Biserial of the item was .69. To the left of each of 
the five options (A to E) are found the standardized dif- 
ferences between matched groups of examinees (focal 
minus reference); the standardized differences for those 
who omitted each item are also included. For example, 
in the version of item 11 that appeared in Form A, 12 
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figure 2. Scatterplot of discrimination estimates between pairs 
of items: Forms A and B. 
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figure 4. Scatterplot of discrimination estimates between pairs 
of items: Forms C and D. 

percent fewer females than matched males selected the 
key (E), and 1 1 percent more females than matched males 
selected distractor (A). 

To the right of this first version of item 11 is a sec- 
ond version with revisions indicated in boldface (in the 
key, "curb" has been changed to "lessen" in Form B). In 
many cases, only one revision was made to the original 
item, but for this item (as for several others) there were 
two additional revisions (in two additional forms) that 
appear directly below the first two versions of item 11: 
in Form C the key was changed to "curb. .enemies," and 
in Form D the key was changed to "lessen. .enemies." 
Tables 2 to 8 present the items in this study classified by 
the hypothesized DIF factors. 



TABLE 2 



Effects of Science Terminology 









MH D-D1F 












MH D-DIF 








Form) 


ETS DIF Category 
DSTD-PX 






(Form) 


ETS DIF Category 




Item 








DSTD-P% 






P 














P 












No. 


R-BU 


F 


H 


B 


A 


Item Text 




R-Bis 


F 


H 


B 


A 


Item Text 


1 1. 


(A) 


-2.15 


-0.56 


-0.78 


-0.58 


In order to the health hazard 




B) 


-1.42 


-0.21 


-1.04 


0.38 


In order to the health hazard 




.73 


C 


A 


A 


A 


caused by an increased pigeon 




.85 


B 


A 


B 


A 


caused by an increased pigeon 




.69 










population, officials have added 
to the area's number of peregrine 
falcons, natural of pigeons. 




.59 










population, officials have added 
to the area's number of peregrine 
falcons, natural of pigeons. 






11 

1 


4 
1 


6 
0 


0 

7 


(A) reduce. .allies 

(B) promote. .rivals 






6 
1 


0 
-1 


8 
-1 


0 
0 


(A) reduce.. allies 

(B) promote.. rivals 






0 


0 


0 


0 


(C) starve. .prey 






0 


1 


0 


1 


(C) starve.-prey 






0 


_2 


0 


1 


(D) counter. .protectors 






0 


0 


0 


-1 


(D) counter.. protectors 






-12 


-4 


-6 


-3 


*(E) curb. .predators 






-6 


-1 


-8 


0 


*(E) lessen.. predators 






0 


1 


1 


0 


(UM.I 1 S) 






0 


1 


1 


0 


(UMI 1 b) 


1 1. 


(C) 


-1.94 


-1.12 


-0.56 


-0.50 


In order to the health hazard 




(D) 


-1.79 


-0.15 


-0.92 


-0.20 


L L ILL. 

In order to the health hazard 




.70 


C 


B 


A 


A 


caused by an increased pigeon 




.80 


C 


A 


A 


A 


caused by an increased pigeon 




.68 










population, officials have added 
to the area's number of peregrine 




.62 










population, officials have added 
to the area's number of peregrine 














falcons, natural of pigeons. 














falcons, natural of pigeons. 

(A) reduce. .allies 






10 


6 


5 




(A) reduce. .allies 






9 


2 


4 


0 






1 


- 


1 


2 


(B) promote. .rivals 






0 


0 


1 


1 


(B) promote.. rivals 






-1 


3 


-1 


0 


(C) starve. .prey 






0 


0 


2 


0 


(C) starve. .prey 






1 


-1 


0 


-1 


(D) counter. .protectors 






0 


0 


1 


0 


(D) counter. .protectors 






-12 


-8 


-4 


-3 


*(E) curb. .enemies 






-9 


-1 


-8 


-1 


* (E) lessen. .enemies 






0 


-1 


0 


0 


/ /"\VtITC\ 






0 


-1 


0 


0 


ILJivll 1 j) 


12. 


(B) 


-1 .75 


-0.04 


-0.94 


-0.83 


Although cacti are no't 




(A) 


-1.30 


-1.18 


-1 .09 


-1.91 


Although cacti are not 




.37 


C 


A 


A 


A 


Hawaii, several species have been 




.88 


B 


B 


B 


C 


Hawaii, several species have been 




.68 










introduced there and are 
flourishing. 




.57 










introduced there and are 
flourishing. 






10 


6 


7 


8 


(A) adaptable to 






4 




6 


3 


(A) adaptable to 






-13 


-1 


-7 


-6 


*(B) indigenous to 






-5 


-7 


-7 


-6 


*(B) native to 






0 


0 


2 


0 


(C) excluded from 






0 


0 


2 


2 


(C) excluded from 






.3 


-5 


-2 


-2 


(D) compatible with 






1 


0 


0 


6 


(D) compatible with 






0 


0 


0 


-1 


(E) limited to 






0 


1 


-1 


0 


(E) limited to 






0 


0 


0 


0 


(OMITS) 






0 


0 


1 


0 


(OMITS) 


21. 


(A) 


-2.27 


-0.33 


-0.60 


0.53 


^I_TI V 4n \ M7CC T1D I K f A TE 

CHlMPANZ.hh:PRI.VlAl h:: 




(B) 


-1.01 


-0.23 


-0.38 


-0.32 


(~- it li <n \ \t7rr T)DI\j( A"T*C 

CHlMrAlNZhh:rRlMAl h:: 




.4 1 


C 


A 


A 


A 






.67 


B 


A 


A 


A 






A "J 


0 


4 


3 


-1 


(A) baDoon:gonMa 




.50 


1 


1 


-1 


1 


(A) DaDoon:gonlia 






1 


2 


5 


1 


(B) cat:kitten 






1 


4 


2 


0 


(B) cat:kitten 






2 


0 


0 


-3 


(C) cocoon:larva 






4 


_2 


-1 


1 


(C) cocoon:larva 






-20 


-2 


-5 




*(D) squirrehrodent 






-8 


-2 


-3 


-3 


*(D) squirrehrodent 






15 


_i 


-4 


-1 


(E) fish:amphibian 






1 


1 


0 


-1 


(E) fishigill 






2 


-T 


1 


1 
—L. 


(OMITS) 






2 


-2 


2 


1 


(OMITS) 


21. 


(Dl 


-1.51 


-0.23 


-0.50 


-0.11 


CHIMPANZEEiPRIMATF.:: 




(CI 


-0.79 


-0.22 


-0.81 


0.84 


CH1MPANZEE:PRIMATE:: 




.48 


c: 


A 


A 


A 






.76 


A 


A 


A 


A 






.46 


-l 


-1 


1 


1 


(A! baboon:gorilla 




.54 


0 


1 


1 


0 


(A) baboon:gonlla 






0 


1 


5 


1 


(B) eaukitteti 






2 


0 


6 


-1 


(B) catikittcn 






0 


3 


1 


1 


(C) cocoon:larva 






I 


4 


1 


_2 


(C) cocoon:larva 






-14 


_2 


-5 


-1 


*(D) mouse:rodcnt 






-5 


_2 


-7 


4 


*(D) frogiamphibian 






■ 12 


2 


-4 


0 


(E) fish:amphibian 






0 


1 


0 


-1 


(E) squirrehreptile 






2 


-3 


3 


_2 


(OMITS) 






2 


-A 


0 


0 


(OMITS) 


24 


(C! 


-2.28 


0.87 


0.17 


0.67 


VORTEXAVATER:: 




(») 


-0.83 


-0.81 


-0.58 


0.38 


WHIRLPOOl.-.WATER:: 




.34 


C 


A 


A 


A 






.76 


A 


A 


A 


A 






.45 


1 


-2 


C 


1 


(A) volcanoicrust 




.37 


1 


1 


1 


0 


(A) volcano:crust 






2 


-1 


} 


-2 


(B) rivendclta 






1 


0 


3 


0 


(B) river:dclta 






-19 


7 


1 


6 


*(C) tornado:Jir 




-6 


-7 


-5 


2 


*(Q tornado:air 






3 


2 


2 


3 


(D) geyser:steam 






3 


4 


( 


_2 


(D) gcyscr:stcam 






-1 


2 


1 


-3 


(E) carthquake:tault 






1 


1 


1 


0 


(E) carthquakc:fault 






14 


-V- 


-8 


-4 


(OMITS) 






0 


1 


2 


0 


(OMITS) 



MH D-DII : : Mantcl-Haens/.cl Index nf Delta Differences lf<K,il minus reference) 

ETS DIF Category; A represents ncRliRibk' DIF. B represents slight to moderate DIF. and C represents nmder.itc to large DIF. 

DSTD-1"V Standardization Index of Proportion Correct Differences (local minus reference) 

F: matched femnle/niale comparison 1 1: matched Hispanic/white comparison 

B: marched black/white comparison A: matched Asian American/white comparison 

"Indicates correct answer. 

Item revisions are indicated Fv boldt.icc. 



lERLC 



BEST COPY AVAILABLE 



il 



Two Initial Observations 

Before turning to the results of this investigation that 
speak to the primary purposes of the study, two related 
observations based on the data warrant some initial con- 
sideration. First is the issue of variation in DIF data for 
identically repeated items between initial pretesting and 
subsequent reprinting for this study. Of course a certain 
amount of "noise" should always be expected in such 
data simply because of differences in the samples and the 
contexts (i.e., surrounding items) in which the repeated 
items appear. It should not be surprising, for example, to 
see more variation in the DIF data of identically reprinted 
items for minority examinees than for female examinees, 
given that minority sample sizes are generally much 
smaller than other sample sizes. (In fact, the standard 
error of the MH delta statistic is about .50 for minority 
groups on SAT-V, while for the male/female comparison 
the standard error is about . 1 5.) Of the 23 items reprinted 
for this study, there were 6 for which the ETS DIF cat- 
egory shifted from "C" to "B" or "A" for identically re- 
printed items. 



Table 9 shows MH D-DIF values for the six items for 
which such shifts in the ETS DIF categories occurred. 
There were also some items with "B" to "A" or "A" to 
"B" shifts for some groups, but these were not included 
in the table because they are not relevant to this study. 
All but two of the six items (Form C, item 13, and Form 
D, item 25) showed shifts in values that were within two 
standard errors of the MH D-DIF statistic. (Differences 
of more than two standard errors are expected to occur 
less than 5 percent of the time by chance alone.) Note that 
two of the identically reprinted items (Form D, items 16 
and 23) each shifted two categories for one of the focal 
groups — from "C" at initial pretesting to "A" in this 
study — even though both shifts were within sampling 
error. Note also that another of the items (Form D, item 
25) actually shifted two categories and changed sign, 
from a positive "C" for Hispanics at initial pretesting to 
a negative "A" for Hispanics in this study; this shift in 
value was beyond that which would be expected given 
sampling error. 

The second general observation from these data not 
directly related to the primary purposes of the study tn- 



table 3 



Effects of Industrial Arts Terminology 









MH D-DIF 












MH D-DIF 




\ 




'Form) 


ETS DIF Category 








ETS DIF Catcgon 








DSTD-P% 








(/ orm) 
P 


DiTD-P% 


i 


Item 


P 
























item text 


No. 


R-Bis 


F 


H 


s 


A 


Item Text 




R-Bm 


F 


H 


B 


A 


19. 


(B) 


-3.25 


-2.03 


-1.49 


-0.86 


RIVET: METAL:: 




(A) 


0.37 


-0.99 


-0.98 


-0.H 


PlN:CLOTH:: j 




.55 


C 


C 


B 


A 






.85 


A 


A 


A 


A 


(A) nccdlcthiniblc 




.45 


3 


3 


2 


4 


(A) nccdlc:thimblc 




.43 


0 


2 


4 


0 






1 


2 


-T 


_2 


(B) cork:bottlc 






-1 


3 


0 


-1 


(B) cnrk.hottlr 1 






5 


8 


4 


2 


(C) naihhammer 






0 


0 


1 


0 


i (..':• nail. hammer | 






-28 


-18 


-13 


-8 


•(D) staple:paper 






1 


-6 




0 


*!l» staple :pjp«r i 






3 


1 


2 


_2 


(E) rope:s\s-ing 






-1 


1 


1 


0 


<Kl rnpcswinn 






16 


4 


7 


5 


(OMITS) 






0 


0 


0 


0 


(OMITS; 


23. 


(D) 


-1.93 


-0.72 


-0.83 


-0.84 


BIT:DRILL:: 




(C) 


-1.5() 


-0 90 


-1.24 


-fl.66 


BIT:DRII !. 




J 3 


C 


A 


A 


A 








B 


A 


B 


A 






.54 


0 


-5 


-1 


3 


(A) vvaxxravan 




.51 


1 


0 


0 


0 


(Al \v.ix.*ra\on 






-1 


1 


2 


2 


(B) impressioiKstvlus 






1 


1 




i 


iBi impression st\lu\ 






9 


-1 


_» 


„2 


IC) handle:brush 






s 


-1 


_ i 


_■> 


U ) handle brush 






-15 


-5 


-5 




•(D) pomtiawl 






-11 


-X 


-11 


-5 


•|Di point. spear 








_1 


0 


._■» 


(E) ncedlc:thrcad 






!) 


4 


» 


-4 


'El needle thread 






9 


12 


5 


7 


(OMITS) 








? 


8 




(OMITS' 


23. 


(A) 


-1.96 


-:.i\> 


-1.6^ 


-0.21 


PRONGS:PITCHFORK:: 




<!!> 


-1.0? 


-2 25 


-1 18 


-1 42 


PRONGS:I'IT(:HFORK.: 




.44 


C 


c 


c: 


A 






"9 


B 


I 


li 


I 






.58 


2 


0 


l 




(A) svaxxravon 






■> 


■> 


-1 


1 


l/\> » Ji iHvuii 






0 


1 


0 


? 


(B) impression :stvlus 






-1 


i 


1 


i 


iBi impression \t\lu\ 






n 


? 


0 




(C) handlr:brush 






s 


"\ 


- 1 


4 


('. i handle biush 






-16 


-16 


-1 1 


-1 


•(D) pomt:asvl 






-6 


-r 


-10 


-15 


*(Di point spear 






0 


6 


1 


0 


(El ncedle:thread 






0 




1 


i 


1 1 needle thri.ul 






1 


5 


6 


7 


(OMITS) 






1 




s 




(O.Mirsi 



MI I D DIE: Mantcl llacnszel Index ot Delta Differences (local minus rctcrcnccl 

ETS DIE C'Jti'Rory: A represents negligible DIF, B represents slight to moderate DIE. and ( represents in. 'derate 1.1 larks' I 1 !' 
US'ID-l'V Standardization Index of " oportion Correct Dttfctenccs (local minus retercmel 
I-: matched female/male comparison 1 1: matched I Itspamc/white comparison 
B: marched black/white comparison A: matched Asian Anierican/whitc comparwni 



•Indicates correct answer. 
Item revisions arc indicated bv boldt.uc. 



TABU: 4 



Effects of Military Terminology 



Item 

No. 


'Form) 

P 
R-Bis 


MH D-D1F 
ETS DIP Category 
DSTD-P% 






(Form) 

P 
R-Bis 


MH D-D1F 
ETS DIF Category 
DSTD-P% 




F 


H 


B 


A 


Item Text 




F 


H 


6 


A 


Item Text 


17. 


(C) 
.78 
.47 


-3.69 
C 
-21 
4 
4 
5 
3 
6 


-1.09 
B 
-9 
S 
3 
0 
2 

0 


-1.23 
B 
-10 
5 
4 
2 
-1 
1 


-0.81 
A 
-4 
1 
1 
-1 
-1 
4 


CONVOY:SHIPS:: 

♦(A) flock:birds 

(B) ferry:passengers 

(C) barn-.horses 

(D) dealership:cars 

(E) highway :trucks 
(OMITS) 




(Dj 
.73 
.60 


0.56 
A 
4 
0 
-2 
0 
-1 
-1 


0.16 
A 
1 
2 

-1 
0 

-1 
0 


0.21 
A 
2 
4 
0 
-2 
-2 
-3 


0.39 
A 
2 
2 
0 
-2 
-1 
-1 


TROUPE:DANCERS:: 

♦(A) flock:birds 

(B) ferry:passengers 

(C) barn:horses 

(D) dealership:cars 

(E) high\vav:trucks 
(OMITS) 


18. 


(A) 
.76 
.33 


-2.84 
C 
1 
5 
3 
1 

-19 

9 


-0.40 
A 
1 
1 
2 
1 

-4 
-1 


-0.37 
A 
0 
0 
2 
1 

-2 
-1 


0.41 
A 
1 

-3 
-1 
0 
3 
0 


DF.TONATE-.EXPLOSION:: 

(A) collide:momentum 

(B) decipher:code 

(C) energize:stimulant 

(D) strike:ore 
*(E) ignite:fire 

(OMITS) 




(B) 
.72 
.44 


-0.32 
A 
0 
1 
1 
0 
-3 
0 


-0.64 
A 
1 
0 
5 
1 

-6 
-1 


-0.63 
A 
1 
1 
5 
0 
-7 
1 


-0.41 
A 
1 

-1 
3 
0 

-3 
0 


PROVOKE:REACTION:: 

(A) collide:momentum 

(B) decipher:code 

(C) energize:stimulant 

(D) strike:ore 
*(E) ignite:fire 

(OMITS) 


19. 


(D) 
.65 
.50 


-2.75 
C 
3 
5 

-•21 

0 
8 
5 


-0.02 
A 
-1 
3 
0 
1 
-1 
-3 


-0.83 
A 
3 
4 

-') 
1 

-1 
3 


-0.52 
A 
2 
3 

-4 
0 
0 

-1 


AMMUNVriON:CARTRIDGE 
BELT:: 

(A) rifle:trigger 

(B) dart:spear 
*(C) arrow:quiver 

(D) golf:course 

(E) football:goalpost 




(C) 
.59 
.50 


-2.34 
C 
1 
3 

-20 
2 
12 
1 


-1.14 
B 
1 

5 

-10 
5 
-2 
1 


-0.92 
A 
0 
5 
-9 
2 
1 
2 


-0.36 
A 
-•2 
1 

-3 
2 
2 
0 


MONEY:WALLET:: 

(A) rifle:trigger 

(B) dart:spear 
*(C) arrow:quiver 

(D) golf:course 

(E) footbalhgoalpost 
(OMITS) 


20. 


(C) 
.50 
.58 


-2.32 
C 
1 
1 
3 

-19 
6 
9 


-0.17 
A 

-1 
1 

-1 
2 
4 

-5 


-0.46 
A 
2 

-1 
4 

-4 
1 

-1 


0.41 
A 
-2 
-i 
-3 
3 
-1 
3 


MUTINY:CAPTAIN:: 

(A) theft:police 

(B) riot:crowd 

(C) plagiarismiauthor 
*(D) strike:employcr 

(E) war:general 
(OMITS) 




(D) 
.51 

.55 


-2.00 
C 
3 
2 
3 

'-17 
1 

8 


0.33 
A 
2 
2 
-3 
3 
1 

-4 


-0.08 
A 
2 
-1 
-1 
-1 
4 
-2 


0.61 
A 

-3 
0 

-4 
S 
1 
1 


MUTINY:CAPTAIN:: 

(A) theft:police 

(B) riot:crowd 

(C) plagiarism:author 
*(D) strike:employcr 

(E) recipe:chef 
(OMITS) 


20. 


(A) 
.61 

.58 


-0.73 
A 

-1 
1 
2 

-6 
2 
1 


-0.62 
A 
1 
2 

-2 
-5 
4 

0 


0.43 
A 
-7 
2 
0 
3 
1 
1 


0.64 
A 

-A 
0 
1 
4 

-2 
0 


REBELLION:AUTHORITY:: 

(A) thcff.police 

(B) riot:crowd 

(C) plagiarism:author 
*(D) strike:employcr 

(E) war:general 
(OMITS) 




(B) 
.62 
.54 


-0.84 
A 
4 
2 
1 

-7 

0 
0 


-0.72 
A 
3 
1 
2 
-6 
1 

-2 


0.18 
A 
-6 
0 
2 
0 
0 
4 


0.55 
A 

-3 
3 

-2 
3 
0 

-1 


REBELLION:AUTHORITY:: 

(A) theft:police 

(B) riof.crowd 

(C) plagiarism:author 
*(D) strikr.employer 

(E) rectpe:chef 
(OMITS) 



volves the easiest and most difficult items in the SAT-V 
(extremes in the difficulty continuum). See, for instance, 
Table 8, item 12, Forms D and C. Since both versions 
were classified as negative "C" for females, it would ap- 
pear that the revision of this item (changing the key from 
"tapping" to "utilizing") did not succeed in eliminating 
the DIF. Yet the differences in proportion correct be- 
tween matched groups of males and females — which are 
reported using Standardization rather than Mantel- 
Haenszel — indicate a shift from -23 to -5. That is, the 
version of item 12 in Form C shows a small DSTD 
p-metric value (a 5 percent difference in performance be- 
tween matched males and females), yet (using the MH 
delta metric) it was still categorized as "C." Note, how- 
ever, that the revision to this sentence completion item 



changed it from a middle difficulty item (50 percent cor- 
rect) to a very easy item (better than 90 percent correct), 
The same sort of phenomenon can be observed at the 
other end of the difficulty scale. See Table 4, item 25, 
Form A. This item was classified as negative "C" for fe- 
males (using the MH delta metric) despite the fact that 
there is only a 4 percent difference in performance on the 
item between matched groups of males and females (us- 
ing the DSTD p metric). Note again, however, the ex- 
treme overall difficulty of the item: only 7 percent of the 
total population answered this question correctly. Data 
on the revised version of the item in Form B (with stem 
and key switched) show that the ETS DIF category 
changed from "C" to "B" for females, yet the difference 
in performance between matched groups of males and 



8 



table 4 (continued) 



Effects of Military Terminology 



Item 

No. 


Form) 

P 
R-Bis 


MH D-D/F 
ETS D1F Category 
DSTD-P% 






tForm) 

P 
R-Bis 


MH D-D/F 
ETS DIF Categorv 
DSTD-PX 




F 


H 


fl 


A 


Item Text 




F 


H 


B 


A 


Item Text 


22. 


(B) 
.47 
.44 


-4.31 
C 
-37 
1 

25 
0 

10 
1 


-1.05 
B 
-9 
2 

-3 
2 
6 
2 


-1.22 
B 
-10 
2 
-5 
-1 
8 
6 


-1.07 
B 
-10 
2 
0 
-2 
6 
3 


COCKPITPILOT:: 

*(A) turret:gunner 

(B) somersault:acrobat 

(C) berth:sailor 

(D) baton:conductor 

(E) sidewalk:pedestrian 
(OMITS) 




(A) 
.59 
.53 


-3.31 
C 
-26 
1 
1 
1 

22 
1 


-1.01 
B 
-9 
1 
3 
1 
5 
-1 


-0.89 

A 
-8 

3 
-1 

3 
-1 

4 


-0.23 
A 
-2 
1 
0 
0 
-1 
2 


COCKPITPILOT:: 

*(A) turretrgunner 

(B) somersault:acrobat 

(C) uniformifire fighter 

(D) baton:drum major 

(E) sidewalk:pedestrian 

(UM1 1 o) 


22. 


(C) 
.89 
.57 


-0.54 
A 
-2 
0 
0 
0 
1 
1 


-1.99 
C 
-10 
3 
1 
1 
3 
2 


-1.92 
C 
-12 
1 
0 
2 
6 
4 


-2.15 
C 
-7 
1 
0 
0 
4 
2 


COCKPITPILOT:: 

*(A) booth:toll collector 

(B) somersault:acrobat 

(C) uniforra:fire fighter 

(D) baton:drum major 

(E) sidewalk:pedestrian 
(OMITS) 




(D) 
.79 
.58 


-0.07 
A 
0 
0 
-1 
0 
2 
0 


-0.99 
A 

-6 
3 

-1 
0 
5 

-1 


-0.62 
A 
-4 
3 
1 
1 
2 
-3 


-0.87 
A 
-4 
1 

0 
0 
3 
0 


STALL: VENDOR:: 

*(A) booth:toll collector 

(B) somersautt:acrobat 

(C) uniform:fire fighter 

(D) baton:drum major 

(E) sidewalk:pedestrian 
(OMITS) 


25, 


(A) 
.07 
.44 


-1.71 
C 
0 
-1 
-4 
-12 
6 
11 


0.50 
A 

-2 
2 

-1 
9 
0 

-8 


0.75 
A 
3 
2 
i 

i 

-3 
-5 


0.45 
A 
0 
0 
2 
2 
-3 
-1 


MERCENARY WARFARE: : 

(A) truant:school 

(B) thief:property 
* (C) hack:writing 

(D) criminal:felony 

(E) defendant :accusation 
(OMITS) 




(B) 
.14 

.22 


-1.29 
B 
1 

-4 
-6 
1 
1 

7 


-0.02 
A 
0 
-2 
0 
2 
6 
-6 


-0.61 
A 
1 
2 

-3 
4 
4 

-8 


-0.92 
A 
3 
-1 
-5 
2 
1 
-1 


HACK: WRITING :: 

(A) truant:school 

(B) thief:property 
*(C) mercenary:warfare 

(D) criminal:felony 

(E) defendant:accusation 
(OMITS) 



MH D-DIF: Mantel-Haenszel Index of Delta Differences (focal minus reference) 

ETS DIF Category: A represents negligible DIF, B represents slight to moderate DIF. and C represents moderate to large DIF. 

DSTD-P%: Standardization Index of Proportion Correct Differences (focal minus reference) 

F: matched female/male comparison H: matched Hispanic/white comparison 

B: matched black/white comparison A: matched Asian American/white comparison 

'Indicates coirect answer. 

Item revisions are indicated by boldface. 



females actually increased slightly (6 percent). On the 
revised version of this item, however, twice as many ex- 
aminees answered correctly overall (14 percent of the 
total population). 

A possible explanation for these apparent DIF 
anomalies among the very easy and difficult SAT-V items 
is related to the nature of the two different metrics being 
used. The Mantel-Haenszel index represents odds ratios 
converted to the ETS delta scale, a scale that is un- 
bounded at the two ends. The Standardization index, on 
the other hand, represents differences in proportions cor- 
rect on a scale (0 to 100) that is bounded at the top and 
bottom (Dorans and Holland 1992). Thus the two sets 
of statistics often behave differently with the easiest and 
most difficult SAT-V items. 

Success Rate in Reducing Differential 
Item Functioning 

One of the primary purposes of this study was to deter- 
mine how successfully SAT-V items with elevated levels 



of DIF — particularly those items for which the DIF 
seemed to be related to a factor under study — could be 
revised in order to reduce or eliminate differential func- 
tioning. In this way, it could be determined whether or 
not similar efforts at revision and re-pretesting would be 
worthwhile in the future. 

This investigation began with 23 items but, as dis- 
cussed above, some of them did not demonstrate C DIF 
after being reprinted and re-pretested for this study. Of 
the 18 items that actually fell into category "C" when 
they were reprinted, 12 items (67 percent) successfully 
shifted from C DIF to B DIF or A DIF after being pre- 
tested with the revisions. All 12 of the items that 
changed from C DIF showed reductions in MH D-DIF 
values outside those expected within sampling error. (All 
five items that did not fall into category "C" for any 
group when reprinted for this study showed small to 
moderate reductions in MH and/or DSTD values after the 
revisions.) Of the 12 items that shifted from C DIF, 9 of 
them shifted from negative C DIF and 3 from positive C 
DIF; since 15 of the 18 actual C DIF items »/ere negative' 
and 3 of the 18 were positive, there was a 60 percent 
success rate in eliminating negative C DIF (i.e., DIF not 
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favoring focal groups) and a 100 percent success rate in 
eliminating positive C DIF (i.e., DIF favoring focal 
groups). 

A closer look at the six items that did not shift from 
C DIF reveals that, in two cases, the revisions did indeed 
eliminate the C DIF for the group originally targeted, but 
a different group ended up vjith C DIF (Table 2, item 12, 
and Table 3, item 23). In another case, the revision suc- 
cessfully eliminated the C DIF but the R-Biserial ended 
up below .30, which meant the revised item could not be 
included in the pool (Table 4, item 25). In a fourth case, 
the MH value shifted dramatically in the intended direc- 
tion but, as discussed above, the revised item became very 
easy for the total population and was still classified "C" 
using the MH delta metric (Table 8, item 12). Thus, in 
only two of the total of 23 items (Table 2, item 11, and 
Table 4, item 19) did no appreciable reduction in DIF 
occur for the targeted groups(s) after the revisions were 
made. 

It must be mentioned, however, that 6 of the 12 items 
for which C DIF was successfully eliminated also became 
substantially easier for the total population, i.e., percent 
correct increased by 25 percent or more (Table 2, items 
21 and 24; Table 3, item 19; Table 4, item 22; Table 6, 
items 14 and 18). Another 2 of the 12 items shifted con- 
tent classification as a result of revisions to the stems 
(Table 4, items 17 and 18). So in 8 out of 12 cases, the 
successfully revised items were significantly changed from 
the original versions either in content or statistics. For 
SAT-V items, elimination of C DIF often seemed to 
change some basic characteristic(s) of the item, yet the 
underlying reasoning skill being tested remained essen- 
tially the same in most cases. Assuming that item pools 
are large enough to allow assemblers to continue to meet 
test specifications, such shifts in content or statistics seem 
less important than the fact that the items no longer show 
elevated levels of DIF and thus can be considered for in- 
clusion in operational forms of the test. 

Factors Related to Differential Item 
Functioning 

Because only a limited number of items were pretested for 
each of the seven factors studied, and because the revi- 
sions of some of the items changed the degree of difficulty 
of the item considerably, conclusions about the relation- 
ship between the DIF factors studied and the observed 
DIF values must be made with caution. 

Effects of Science Terminology 

Technical (specialized) science material and substantive 
contexts drawn from science have been found to affect 



negatively the performance of female examinees on the 
SAT-V (Lawrence, Curley, and McHale 1988; Lawrence 
and Curley 1989; Scheuneman and Gerritz 1990). A look 
at Table 2 reveals that C DIF for females was eliminated 
after the revisions were made in three of the four science 
items included in this study; the other item (11) showed 
some reduction in the MH value in Form B (-1.42), but 
the further revisions in Forms C and D showed a return 
of negative C DIF for females. In item 12 the change in 
the key from "indigenous" to "native" eliminated the C 
DIF for females but introduced C DIF for Asian Ameri- 
can examinees; the item also became much easier overall 
(88% correct) because of the revision. Items 21 and 24 
became markedly easier, too, after the revisions were 
made. Note in item 2 1 that females were attracted differ- 
entially whenever "fish:amphibian" was used a r a wrong 
answer choice (Forms A and D), but the version in Form 
C worked very well with "frog:amphibian" as the key. 
In item 24, females differentially omitted the item when 
"vortex" was in the stem but not when "whirlpool" was 
in the stem. 

Effects of Industrial Arts Terminology 

The revisions made in both of the items shown in Table 
3 significantly lowered the elevated levels of negative DIF 
for female examinees but also made both of the items 
considerably easier for the total population. In item 
19, after changing "rivet:metal" to "pin:CLOTH," the dif- 
ferential percentage of matched females who omitted the 
item was reduced from 16 to zero. Also in item 19, the 
level of negative DIF for Hispanic and black examinees 
(as well as for females) was greatly reduced. In item 23, 
high levels of DIF against matched females were reduced 
only when both the stem and key were revised but, with 
the introduction of the new stem ("prongs:pitchfork"), 
larger amounts ot negative DIF for Hispanic and black 
examinees appeared (Form A). Then, with the addition 
of the new key ("point:spear"), negative C DIF for His- 
panic and Asian American examinees was observed 
(Form B). 

Effects of Military Terminology 

Table 4 reveals that negative C DIF for females was suc- 
cessfully eliminated in four out of the six items when the 
analogy stems "convoyiships," "detonate:explosion," 
' mutinyiCAPTAIN," and "cockpitipilot" were changed (re- 
spectively) to "troupeidancers," "provoke:reaction," 
"rebellion:authority," and "stall:vendor." The revi- 
sions in the stems of items 17 and 18 also changed the 
content categories from "Piactical Affairs/Social Sci- 
ences" to "Humanities/Human Relations," but the over- 
all difficulty levels remained approximately the same. The 
revision of the stem of item 1 9 did not eliminate the nega- 
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TABLE 5 



Effects of Contexts Portraying Aggression or Conflict 









MH D-D1F 










MHD-D1F 








Form) 


ETS DIF Category 






ETS DIF Category 




( 




DSTD-P% 








[tormj 
P 




DSTD-PX 






Item 

Mo. 


P 
R-Bis 


F 


H 


B 


A 


Item Text 




R-Bis 


F 


H 


B 


A 


item text 


i l 
i j. 




-1.35 


0.25 


0.12 


0.61 


Thpci* nminniis dpvelonmc*nts 




(D) 


-0.99 


0.42 


0.96 


0.74 


These inauspicious developments 




.41 


TJ 
D 


A 

A 


A 
rt 


A 

A 


suggest that the political conflict 




.27 


A 


A 


A 


A 


suggest that the political conflict 




.76 










in that country has entered a new 




.46 










in that country has entered a new 












anH mnrp — — - nh.i^t 

nllii IlIVJIV una jv. 














and more phase. 






z 


4 


0 


— 1 


(A) moderate 






-2 


c 
—.1 


-/ 


-2 


(A) moderate 






0 


1 


3 


-1 


(B) legitimate 






— 1 


— Z 


i 
i 


— j 


(B) legitimate 






7 


-3 


-3 


-i 


(C) productive 






7 


3 

J 




_3 


(C) productive 






-V 


1 


1 


A 

4 


* (D} perilous 






_7 


3 


4 


g 


*(D) perilous 






— 1 


1 


0 


U 


\L,I llHUllaCv-lUvlIllul 






o 


o 




1 


(E) inconsequential 






1 


—4 


-0 


1 


(OMITS) 






4 




_3 


1 


(OxMlTS) 


1 1 
i j. 




-0.96 


-0.25 


-0.35 


-0.05 


*T"hpc/» nminniK (ii*vtlnnmpnt< 

i IIVOC U 11 11 1 1 \J\I J W 1 \JLM1 IV* 1 1 L 3 




(B) 


0.16 


-0.38 


-0.05 


0.33 


These auspicious developments 






A 


A 


A 


A 


cne*t»p^t thit tht* "social climate in 




.53 


A 


A 


A 


A 


suggest that the political climate 




.76 










that country has entered a new 




.27 










in that country has entered a new 














and more phase. 














and more phase. 






u 


u 


< 

0 


i 
— z 


(A) moderate 






_4 


-1 


-1 


-3 


(A) hazardous 






_ 1 




1 


2 


(B) legitimate 






-1 


4 


o 


2 


(B) illegitimate 






5 


- 


•7 


u 


(C) productive 






(j 


o 


4 


1 


(C) unproductive 






c 
-J 


-1 


-1 


u 


*(D) perilous 






2 


_4 


_1 


3 


*(D) promising 






t\ 
V 


i 

— i 


A 
V 


1 
i 


\ J 11HVJ1I JV. UUV-llLIc" 1 






1 


2 


0 


-1 


(E) inconsequential 






i 
1 


~' 


-3 


o 


(OMITS) 






3 


_1 


_2 


-2 


(OMITS) 


15. 


(B) 


— L.o4 


-0.30 


-0.33 


(\ AA 
U.44 


Heretofore for his emphasis 




(A) 


-2.33 


0.8? 


0.15 


0.36 


In the past the general had been 












on defensive stratppipt the 




.17 


C 


A 


A 


A 


for his emphasis on defensive 




.15 


c 


A 


A 


A 


general was when doctrines 




.61 










strategies, but he was when 




.55 










emphasizing aggression were 














doctrines emphasizing aggression 














discredited. 






-3 


2 


-2 


-1 


were discredited. 






-i 


2 


-1 


0 


(A) criticized. .discharged 






1 


2 


0 


0 


(A) criticized. .discharged 






i 


_3 


0 


4 


(B) parodied. .ostracized 






5 


-9 


2 


-5 


(B) parodied. .ostracized 






7 


2 


2 


-6 


(C) supported. .disappointed 






-11 


3 


T 


2 


(C) supported. .disappointed 






-7 


-1 


0 


2 


*(D) spurned. .vindicated 






8 


3 


i 


2 


*(D) spurned. .vindicated 






1 


0 


2 


_2 


(E) praised. .disregarded 






0 


0 


-i 


2 


(E) praised. .disregarded 






0 


1 


-3 


1 


( OMITS 1 














(OMITS) 


15. 


ID) 


-1.56 


-0.28 


-0.21 


0.04 


Heretofore for his emphasis 




(C) 


-1.27 


0.26 


0.25 


-0.04 


Heretofore for her emphasis 




.18 


C 


A 


A 


A 


on defensive strategies, the 




.14 


B 


A 


A 


A 


on conservation, the economist 




.51 










general was when doctrines 




.59 










was when doctrines 














emphasizing aggression were 














emphasizing consumption were 














discredited. 














discredited. 






-1 


-1 


0 


1 


(A) criticized. .discharged 






_2 


5 


-1 


-3 


(A) criticized. .discharged 






1 


1 


0 


-1 


(B) parodied. .ostracized 






0 


-1 


3 


3 


(B) parodied. .ostracized 






9 


4 


7 


-4 


(C) supported. .disappointed 








1 


2 


0 


(C) supported. .disappointed 






-8 


-2 


0 


0 


*(D) chastised..vindicated 






-5 


1 


1 


0 


*(D) spurned. .vindicated 






0 


3 


—i 


0 


(E) praised. .disregarded 






2 


-1 


-2 


-6 


(E) praised. .disregarded 






-1 


-5 


-4 


4 


(OMITS) 






-3 


-5 


-3 


1 6 


(OMITS) 



MH D-DIF: Mantel-Haenszel Index of Delta Differences (focal minus reference) 

ETS DIF Category: A represents negligible DIF, B represents slight to moderate DIF, and C represents moderate to large DIF. 

DSTD-P%: Standardization Index of Proportion Correct Differences (focal minus referencel 

F: matched female/male comparison H: matched Hispanic/white comparison 

B: matched black/white comparison A: matched Asian American/white comparison 

"Indicates correct answer. 

Item revisions are indicated bv boldface. 



tive C DIF for female examinees, although a further revi- 
sion of the key ("arrow:quiver") might have helped to 
produce the intended effect. 

hem 20 reveals that changing only a distractor 
("wangeneral" to "recipe:chef" in Form D) did not elimi- 
nate the negative C DIF for females; rather, "mutiny" 
was the term that females seemed less familiar with than 
the matched groups of males. Item 22 is interesting in that 
the terms in the key, "turret:gunner," seemed more dif- 
ferentially difficult for females than the terms in the stem, 
"cockpit:P1LOT" (see the version in Form C). With the 



change to a new key in Form C ("booth:toll collector"), 
however, negative C DIF was present for all three minor- 
ity groups. The C DIF was eliminated entirely in Form D 
when the stem was revised as well to "stall:vendor." 
Item 25 was discussed earlier; the R-Biserial of .22 in 
Form B makes this version unacceptable for use in the 
pool even though the MH value was reduced. 

Contexts Portraying Aggression/Conflict 

Items suggesting aggression or conflict as well as items 
with a strongly negative, possibly upsetting tone have 

11 
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TABLE 6 



Effects of Special Interest Terminology 



Item 

No. 


(Form) 
P 


MHD-DIF 
ETS DIF Category 
DSTD-P% 






(Form) 

P 
R-Bis 


MHD-DIF 
ETS DIF Category 
DSTD-P% 




R-Bis 


F 


H 


B 


A 


Item Text 




F 


H 


B 


A 


Item Text 


14. 


(C) 
.32 
.66 


2.15 
C 

14 
-4 

0 
-3 
-2 
-6 


-0.41 
A 

-2 
2 
9 
0 
-1 
-8 


-0.86 
A 

-4 
-1 
5 
3 
-3 
-1 


0.31 
A 

2 
-3 
0 
1 

-1 
1 


It is its , its rhythmic energy 

and expansive vivacity, that 
makes jazz so typically American. 
*(A) verve 

(B) pauciry 

(C) formality 

(D) quiescence 

(E) oerivativencss 
(OMITS) 




(D) 
.59 
.64 


1.19 
B 

9 
-2 

0 
-1 
-2 
-3 


0.61 
A 

4 
-3 
-1 

3 
-1 
-2 


0.25 
A 

2 
-1 
3 
1 
-1 
-3 


0.98 
A 

7 
-2 
1 

-3 
-2 
-2 


It is its , its rhythmic energy 

and expansive vivacity, that 
makes jazz so typically American. 
*(A) vitality 

(B) paucity 

(C) formality 

(D) quiescence 

(E) derivativeness 
(OMITS) 


24. 


(A) 
.30 
.45 


1.53 
C 
0 
-4 
12 
-2 
3 

-10 


1.01 
B 
3 

-1 
7 

-1 
0 

-8 


0.97 
A 
0 
2 
7 
1 
1 

-10 


-0.08 
A 
1 
1 
0 
0 
-1 
-1 


DOTE:FONDNESS:: 

(A) improvise:practicc 

(B) attract:repulsion 
*(C) pamper:indulgence 

(D) unnerve:composurc 

(E) supervise:regulation 
(OMITS) 




(B) 
.34 
.50 


1.06 
B 
1 

-2 
8 

-1 
0 

-6 


1.05 
B 
3 
-2 
8 
1 

-2 
-7 


0.62 
A 
0 
1 
5 
0 
2 

-6 


0.53 
A 
-1 
-2 
4 
4 
1 

-8 


ABHOR:DISTASTE:: 

(A) improvise:practice 

(B) attract:repulsion 
*(C) pamper:indulgence 

(D) unnerve:composure 

(E) supervise:regu!ation 
(OMITS) 


18. 


(D) 
.47 
.29 


0.81 
A 
0 
8 
0 
-2 
1 

-6 


-0.88 
A 
1 

-8 
3 
4 
2 

-2 


2.08 
C 
1 
21 
-3 
2 
-2 
-19 


0.36 
A 
1 
3 
-1 
-1 
1 

-2 


PLAITrHAIR:: 

(A) knead:bread 
*(B) weave:yarn 

(C) cut:cloth 

(D) fold:paper 

(E) frame:picture 
(OMITS) 




(C) 
.80 
.29 


-0.28 
A 
2 

-2 
0 

-1 
1 
0 


-0.72 
A 
-1 
-6 
2 
1 
4 
0 


-0.39 
A 
0 

-4 
1 

-1 
3 
1 


-0.34 
A 
1 

-2 
1 
0 
0 
1 


BRAID:HAIR:: 

(A) knead:bread 
*(B) weave:yarn 

(C) cut:cloth 

(D) fold:paper 

(E) frame:picture 
(OMITS) 



MH D-DIF: Mantel-Haenszel Index of Delta Differences (focal minus reference) 

ETS DIF Category: A represents negligible DIF, B represents slight to moderate DIF, and C represents moderate to large DIF. 

DSTD-P%: Standardization Index of Proportion Correct Differences (focal minus reference) 

F: matched female/male comparison H: matched Hispanic/white comparison 

B: matched black/white comparison A: matched Asian American/white comparison 

'Indicates correct answer. 

Item revisions are indicated by boldface. 

TABLE 7 



Effects of Cognates 



Item 
No. 



(Form) 

P 
R-Bis 



14. 



(A) 
.28 
.60 



MHD-DIF 
ETS DIF Category 
DSTD-P'/o 



-1.33 
B 



H 



-1.19 
B 



B 



-0.52 
A 



-0.78 
A 



Item Text 



Although scholars often wrestle 

with how to the impact of 

various influences in an author's 
life on that author's work, they 

sometimes neglect to the 

effect of such forces on their own 
writings. 

(A) quantify, .expunge 

(B) surmise. .censor 

(C) evaluatcamplify 
*(D) gauge.. scrutinize 

(E) disguise. .amend 
(OMITS) 



(Form) 

P 
R-Bis 



(B) 
.45 
.48 



MHD-DIF 
ETS DIF Category 
DSTD-P% 



-0.18 



H 



-0.20 
A 



0.39 
A 



-0.55 
A 



Item Text 



Although scholars often wrestle 

with how to the impact of 

various influences in an author's 
life on that author's -vork, they 

sometimes neglect to the 

effect of such forces on their own 
writings. 

(A) quantify..expunge 

(B) surmisccensor 

(C) evaluate. .amplify 
*(D) measure.. scrutinize 

(E) disguise. .amend 
(OMITS) 



25. 



(D) 
.14 
.37 



0.14 
A 
1 
0 
0 
1 
0 
-2 



-0.09 
A 
3 
8 
-2 
5 
0 

-13 



1.01 
B 
1 
3 
0 
1 
4 

-10 



0.39 
A 
0 
0 
3 
-2 
2 



DULCET:TONE:: 

(A) plcased:smile 

(B) monotonous:voice 

(C) insatiable:appctitc 

(D) sarcasticwit 
*(E) dclicious:taste 

(OMITS) 



(C) 
.26 
.52 



0.01 
A 
0 
5 

-1 
0 
0 

-3 



-0.21 
A 
-1 
6 
4 
4 
-1 
-12 



0.03 
A 
2 
5 
2 

-1 
0 

-8 



0.72 
A 
-1 
-1 
0 
-2 
6 
-2 



EUPHONIOUS:TONE:: 

(A) pleased:smile 

(B) monotonous:voicc 

(C) insatiable:appetite 

(D) sarcasticwit 
*(E) delicious:taste 

(OMITS) 



MH D-DIF: Mantel-Haenszel Index of Delta Differences (focal minus reference) 

F.TS DIF Category: A represents negligible DIF, B represents slight to moderate DIF, and C represents moderate to large DIF. 

DSTD-P%: Standardization Index of Proportion Correct Differences (focal minus reference) 

F: matched female/male comparison H: matched Hispanic/white comparison 

B: matched black/white comparison A: matched Asian American/white comparison 

"Indicates correct answer. 

Item revisions arc indicated bv boldface. 
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"1 
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16. 


(D) 


-1.43 


-1.59 


-0.90 


-1.39 


SHORE:LAKE:: 




(C) 


-1.22 


-0.91 


-1.23 


-0.54 


SHORE:LAKE:: 




.94 


B 


B 


A 


A 






.82 


B 


A 


B 
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-3 


-6 


-4 


-3 


*(A) bank:river 




.50 


-7 


-6 


-11 


-2 


*(A) frame:piaure 






1 


0 


1 


-> 
L. 


(B) floor:ocean 






2 


-1 


5 


2 


(B) floor:ocean 






1 


4 


0 


1 


(C) wave:coast 






2 


0 


0 


1 


(C) wave:Coast 






0 


1 


0 


0 


(D) height:tower 






1 


3 


1 


1 


(D) height:tower 






1 


2 


2 


0 


(E) currcnt:water 






2 


4 


4 


-1 


(E) current:watcr 






0 


0 


1 


0 


(OMITS) 






1 


0 


2 


0 


(OMITS) 


17. 


(B) 


-0.32 


-2.27 


-2.06 


-2.37 


DYE:FABRIG: 




(A) 


0.14 


-0.55 


-0.66 


-0.07 


DYE:FABRIC:: 




.86 


A 


C 


C 


C 






.80 


A 


A 


A 


A 






.51 


0 


3 


3 


3 
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.40 


-1 


4 


-1 
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3 
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1 


-4 


-7 


0 
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0 


2 


2 
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(D) fuel:engine 






0 


3 


3 


1 


(D) fuchengine 






1 


7 


7 


1 


(E) ink:pen 






3 


-2 


3 


-3 


(E) ink:pen 






0 


1 


1 


1 


(OMITS) 






0 


0 


0 


0 


(OMITS) 



MH D-DIF: Mantel-Haenszel Index of Delta Differences (focal minus reference) 

ETS DIF Category: A represents negligible DIF. B represents slight to moderate DIF. and C represents moderate to large DIF. 

DSTD-P%: Standardization Index of Proportion Correct Differences (focal minus reference! 

F: matched female/male comparison H: matched Hispanic/white comparison 

B: matched black/white comparison A: matched Asian American/white comparison 

* Indicates correct answer. 

Item revisions are indicated by boldface. 



been postulated to be related to negative DIF for females 
(Wendler and Carlton 1987). Two "aggression/conflict" 
items with four versions each are presented in Table 5. 
Item 1 3 in Form C was the original pretested item, but as 
indicated before, this item was no longer classified as C 
DIF in this investigation. Nevertheless, the three revisions 
did perform as expected, with the version in Form B 
showing the least amount of DIF. Unfortunately, the R- 
Biserial in Form B is .27, slightly below the acceptable 
level for items (such as this version) of middle difficulty. 
Item 15, a very difficult sentence completion, remained 
difficult after each revision (14 to 18 percent correct). The 
version in Form C changed the context of the sentence 
from war to economics and did shift the category of nega- 



tive C DIF to negative B DIF for female examinees, but 
the change in DSTD vslue from the original version to 
that in Form C was only 2 percent, a negligible difference. 

Special Interest Terminology 

Terminology of special interest or familiarity to a particu- 
lar group (perhaps due to greater exposure or retention 
of it by that group) has been hypothesized to affect posi- 
tively the performance of that group when compared to 
the performance of a group without this special interest 
(Schmitt 1985, 1988; Schmitt and Bleistein 1987; 
Schmitt, Curley, Bleistein, and Dorans 1988; Bleistein, 
Schmitt, and Curley 1990). 
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TABLE 9 



Mantel-Haenszel Values (MH D-D1F) and ETS D1F Categories for Selected SAT- Verbal Items Reprinted in this Study Identically 
to the Initial Pretest 



Form and Item 
Number in this 
Study 


Table 
Number 


Focal 
Group 


MH D-D/F and 
ETS DIF Category 
in this Study 


MH D-DIF and ETS DIF 
Category from Prior 
Pretesting 


Form C, No. 13 


5 


Females 


-1.35 (category B) 


-1.80 (category C} 


Form A. No. 14 


7 


Hispanics 


-1.19 (category B) 


-2.00 (category C) 


Form A, No. 16 


8 


Hispamcs 


-1 .43 (category B) 


-1.78 (category C) 


Form D, No. 16 


X 


Hispanics 
Asian Americans 


-1.59 (category B) 
-1.39 (category Al 


-2.04 (category C) 
-1.91 (category C) 


Form D, No. 2.5* 


3 


Blacks 


-0.83 (category Al 


-1 .59 (category C) 


I'orm D, No. 25 


7 


Hispanics 


-0.09 (category A) 


+ 1.71 (category C) 


'This item was also C DIF for females at initial pretesting 


and remained C DIF for females 


when reprinted for this study. 





Table 6 presents three item-pairs in which terms 
deemed of special interest were varied. Items 14 and 24 
contain the terms "verve" and "dote," which were con- 
sidered of possible special interest to female examinees, 
while item 18 contains the term "plait," which was 
deemed of special interest to the black group. These judg- 
ments about special interests were made after analyzing 
the DIF data from the initial pretesting. In items 14 and 
18, only the term considered of special interest was 
changed in the second version; in item 24, only the two 
stem terms were changed. In all three items, the revised 
version (in which the term of special interest was replaced 
by a hypothetically neutral synonym) was no longer dif- 
ferentially easier for the focal group. No extreme MH C 
DIF is evident and the DSTD index also indicates a re- 
duction in the expected direction. It is important to note, 
nevertheless, that the revised versions became notably 
easier (+25 percent or more) for items 14 and 18. 

Cognates 

Words that have the same meaning in English as do close 
approximations of the words in Spanish have been pos- 
tulated to affect positively the performance of Hispanic 
examinees when compared to white examinees (Schmitt 
1985, 1988; Schmitt, Curley, Bleistein, and Dorans 1988; 
Schmitt and Dorans 1991). 

Table 7 presents two item-pairs in which words con- 
sidered cognates or noncognates were replaced with syn- 
onyms. In item 14, the word "gauge" (in the key) is a 
noncognate that was replaced in Form B with the word 
"measure," which is a cognate. Although the reprinted 
version of this item in Form A is not classified as C DIF 
as it was when initially pretested, the revision (Form B) 
does show a reduction in the level of both MH and DSTD 
DIF for Hispanic examinees (and for females, perhaps 
because the word "gauge" is used in science and indus- 



trial arts). Item 25 was discussed earlier in this paper; it 
initially showed an elevated C DIF in favor of Hispanics 
but, when reprinted for this study, no appreciable effect 
of the change in the stem from "dulcet" to "euphonious" 
appeared. 

Homographs 

Words spelled and pronounced alike but having mul- 
tiple meanings have been postulated to be sources of 
vocabulary confusion that could negatively affect the 
performance of some focal group examinees when corn- 
pared to the performance of comparable reference group 
examinees (Schmitt 1985, 1988; Schmitt and Bleistein 
1987; Schmitt, Curley, Bleistein, and Dorans 1988; 
Bleistein, Schmitt, and Curley 1 990; Schmitt and Dorans 
1991). 

Four item-pairs in which homographs were replaced 
by comparable terms with single meanings are presented 
in Table 8. Two of the items (item 16 in Form A and item 
1 6 in Form D) were very easy and, when pretested again, 
were not classified as extreme C DIF items. Nevertheless, 
when the homograph was replaced for the other version 
of these items a small (and insignificant) reduction in the 
negative MH values was observed. For item 12, the ex- 
treme negative DIF observed for female examinees was 
reduced when the key "tapping" was changed to "utiliz- 
ing," but the overall difficulty of the item was also re- 
duced considerably. The revised version of the item is still 
classified as negative C DIF for females but, as discussed 
earlier, this classification may be an artifact of the MH 
delta metric. Item 17 behaved almost exactly as expected. 
The overall item difficulty and discrimination did not 
change much but, after substituting the word "paint" for 
"stain" in the key and "stain" for "paint" in the (A) op- 
tion, the negative DIF observed for all three minority fo- 
cal groups was reduced considerably. 
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Conclusions 

Several diverse conclusions can be drawn from the data 
analyzed in this investigation. First, it would appear that 
revising and re-pretesting SAT-V items to eliminate C DIF 
is feasible and likely to succeed often enough to make it 
practical to do so, particularly when prior research on 
hypothesized DIF factors and/or factors based on ob- 
served occurrences of extreme DIF inform the revisions. 
Changing one or both words in the stems of analogies 
seemed to be related to the largest and most consistently 
predictable changes in DIF data, although in many cases 
such stem revisions also strongly influenced the overall 
difficulty of the items. Vocabulary-oriented revisions in 
the keys of sentence completions also proved effective. 
Item discrimination almost always remained at accept- 
able levels for the revised versions of both types of ques- 
tions. Changes only to wrong answer choices (distractors) 
rarely had strong influence on the DIF data. With the 
above guidelines in mind — and assuming it is desirable 
or necessary to do so — it seems appropriate to recom- 
mend further such revisions of C DIF items to reduce or 
eliminate differential difficulty as long as such revised 
items are re-pretested and reanalyzed for DIF. 

Second, the particular terminology used in the stems 
and keys of analogies and sentence completions seems to 
be a significant source of elevated levels of DIF on the 
SAT-V. This hypothesis is supported by the fact that, af- 
ter the revision of one or two words in most of the items 
studied, DIF was reduced to acceptable levels. If particu- 
lar terminology (distinct from underlying analogical rea- 
soning skills or the ability to follow the logic of sentences) 
is often related to elevated levels of DIF, then evaluation 
of the construct relevance or irrelevance of individual C 
DIF items would seem to be appropriate and should be 
conducted as part of the routine development of tests 
such as the SAT-V. 

Third, to the extent possible, larger sample sizes for 
focal groups (particularly minority) would seem to be a 
desirable goal, since the stability of ETS DIF categories is 
reduced when the sample sizes are small. More than 20 
percent of the items studied in this investigation were 
classified as "C" when first pretested but then as "B" or 
"A" when reprinted identically for comparable popula- 
tions. (This percentage is even greater if one considers 
only the DIF data for the minority groups, for which 
sample sizes are relatively small.) Such variations are 
problematic not only because they make it difficult to 
study the effects of systematic revisions of items but also 
because, more importantly, they undermine the effort to 
screen out items with elevated levels of DIF from opera- 
tional test forms. Without stable classifications, test de- 



velopers and statisticians cannot be certain which pretest 
items to review for construct-irrelevant sources of DIF 
and which pretest items not to review. 

Fourth, for classifying the level of DIF (i.e., the "A," 
"B," and "C" categories), a combination of the Standard- 
ization p metric and the Mantel-Haenszel delta metric for 
very easy and very difficult items seems logical given that 
the MH (delta-metric) statistic at the extremes of the dif- 
ficulty continuum has larger standard errors than does 
the DSTD (p-metric) statistic. Because "the delta metric 
is unbounded at the extremes..., differences for easy and 
hard items are played up" (Dorans and Holland 1992, 
p.27). The fact that DF data such as those found for item 
12 (Form C) in Table 8 and for item 25 (Form A) in Table 
4 yield classifications of "C" is unfortunate. These items 
do not reveal "moderate to large" amounts of DIF; 
rather, they are merely very easy or very difficult items 
for which the MH delta metric is not as appropriate an 
indicator of DIF as is the DSTD p metric. 

A final thought relates to the factors (derived from 
prior DIF research and/or observation of pretested items 
with extreme DIF) that were used in selecting the items 
for this investigation. Because the primary purpose of the 
study was to evaluate whether or not revisions to 
C DIF items could be made efficaciously, evaluation of 
the various factors themselves was necessarily ancillary: 
not many items were studied for most of the individual 
hypotheses. Yet the authors, in conducting this investi- 
gation and attempting to draw conclusions from the data, 
had to try to determine for themselves the source(s) of the 
observed C DIF (beyond issues such as sample size, diffi- 
culty level, and the metric used). If, as concluded earlier 
in this section, the particular terminology used in the 
stems and keys of SAT-V questions is related to elevated 
levels of DIF, then that DIF is likely also related to read- 
ing and other means of vocabulary acquisition, which are 
part of the construct of reasoning tests such as the SAT- 
V that measure developed verbal abilities. 

It is important that an incorrect "message" not be 
transmitted to examinees, teachers, and others concern- 
ing the application of DIF statistics to the test develop- 
ment process: the deletion of entire categories of items 
that happen to include some specialized terminology 
might erroneously suggest that breadth and depth of vo- 
cabulary are not important. Students should be encour- 
aged to continue to strive for breadth of coverage in their 
reading and course work. 

Future exploration of the construct relevance of fac- 
tors related to DIF could address questions such as: Are 
individual C DIF items that include particular terminol- 
ogy such as that evaluated in this study relevant to the 
construct measured by tests of developed verbal ability 
such as the SAT-V? One way to try to answer such a 
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question empirically would be to determine how particu- 
lar items or categories of items associated with elevaced 
levels of DIF are related to the predictive validity of such 
tests for all groups of examinees. Such an exploration 
could be a significant contribution to future research in 
this area. 



References 

Bleistein, C. A., A. P. Schmitt, and W. E. Curley. 1990. "Fac- 
tors Hypothesized to Affect the Performance of Black Ex- 
aminees on SAT-Verbal Analogy Items." Paper presented 
at the annual meeting of the National Council on Measure- 
ment in Education, April, Boston, Mass. 

Bleistein, C. A., and D. Wright. 1986. "Assessment of Unex- 
pected Differential Difficulty for Asian-American Candi- 
dates on the SAT." In Differential Item Functioning on the 
Scholastic Aptitude Test (RM-87-01 ), ed. A. P. Schmitt and 
N. Dorans. Princeton, N.J.: Educational Testing Service. 

Dorans, N. J. 1989. "Two New Approaches to Assessing Dif- 
ferential Item Functioning: Standardization and the Man- 
tel-Haenszel Method." Applied Measurement in Education 
2: 217-33. 

Dorans, N. J., and P. W. Holland. 1992. DIF Detection and 
Description: Mantel- Haenszel and Standardization (RR- 
92-10). Princeton, N.J.: Educational Testing Service. 

Dorans, N. J., and E. Kulick. 1983. Assessing Unexpected Dif- 
ferential Item Performance of Female Candidates on SAT 
and TSWE Forms Administered in December 1977: An 
Application of the Standardization Approach (RR-83-9). 
Princeton, N.J.: Educational Testing Service. 

Dorans, N. J., and E. Kulick. 1986. "Demonstrating the Utility 
of the Standardization Approach to Assessing Unexpected 
Differential Item Performance on the Scholastic Aptitude 
Test." Journal of Educational Measurement 23: 355-68. 

Dorans, N. J., A. P. Schmitt, and C. A. Bleistein. 1992. "The 
Standardization Approach to Assessing Comprehensive 
Differential Item Functioning." Journal of Educational 
Measurement 29:309-19. 

Dorans, N. J., A. P. Schmitt, and W. E. Curley. 1988. "Differ- 
ential Speededness: Some Items Have DIF Because of 
Where They Are, Not What They Are." Paper presented 
at the annual meeting of the National Council on Measure- 
ment in Education, March, New Orleans, La. 

Holland, P. W., and D. T. Thayer. 1988. "Differential Item 
Performance and the Mantel-Haenszel Procedure." In Test 
Validity, ed. H. Wainer and H. I. Braun, pp. 129-45. 
Hillsdale, N.J.: Erlbaum. 

Lawrence, I. M., and W. E. Curley. 1989. Differential Item 
Functioning for Males and Females on SAT-Verbal Read- 
ing Subscore Items: Follow-up Study (RR-89-22). 
Princeton, N.J.: Educational Testing Service. 



Lawrence, I. M., W. E. Curley, and F. J. McHale. 1988. Differ- 
ential Functioning of SAT-Verbal Reading Subscore Items 
for Male and Female Examinees (RR-88-10). Princeton, 
N.J.: Educational Testing Service. 

Mantel, N., and W. M. Haenszel. 1959. "Statistical Aspects of 
the Analysis of Data from Retrospective Studies of Dis- 
ease." Journal of the National Cancer Institute 22: 719— 
48. 

Petersen, N. 1988. "DIF Procedures for Use in Statistical Analy- 
sis." Unpublished memorandum issued September 14, 
1988. 

Rogers, H. J., and E. Kulick. 1987. "An Investigation of Unex- 
pected Differences in Item Performance between Blacks 
and Whites Taking the SAT." In Differential Item Func- 
tioning on the Scholastic Aptitude Test (RM-87-01 ), ed. A. 
P. Schmitt and N. Dorans. Princeton, N.J.: Educational 
Testing Service. 

Scheuneman, J. D. 1987. "An Experimental Exploratory Study 
of Causes of Bias in Test Items." Journal of Educational 
Measurement 24: 97-1 18. 

Scheuneman, J. D., and J. A. Briel. 1988. "Differential Effects 
of Selected Item Factors on the Performance of Hispanic 
and White Examinees." Paper presented at the annual 
meeting of the American Educational Research Associa- 
tion, April, New Orleans, La. 

Scheuneman, J. D., and K. Gerritz. 1990. "Using Differential 
Item Functioning Procedures to Explore Sources of Item 
Difficulty and Group Performance Characteristics." Jour- 
nal of Educational Measurement 27: 109-31. 

Schmitt, A. P. 1985. Assessing Unexpected Differential Item 
Performar.ee of Hispanic Candidates on SAT Form 
3FSA08 and TSWE Form E47 (SR-85-169). Princeton, 
N.J.: Educational Testing Service. 

Schmitt, A. P. 1988. "Language and Cultural Characteristics 
that Explain Differential Item Functioning for Hispanic 
Examinees on the Scholastic Aptitude Test." Journal of 
Educational Measurement 25: 1-13. 

Schmitt, A. P., and C. A. Bleistein. 1987. Factors Affecting Dif- 
ferential Item Functioning for Black Examinees on Scho- 
lastic Aptitude Test Analogy Items (RR-87-23). Princeton, 
N.J.: Educational Testing Service. 

Schmitt, A. P., W. E. Curley, C. A. Bleistein, and N. J. Dorans. 
1988. "Experimental Evaluation of Language and Interest 
Factors Related to Differential Item Functioning for His- 
panic Examinees on the SAT- Verbal." Paper presented at 
the annual meeting of the National Council on Measure- 
ment in Education, March, New Orleans, La. 

Schmitt, A. P., and N.J. Dorans. 1990. "Differential Item Func- 
tioning for Minority Examinees on the SAT." Journal of 
Educational Measurement 27: 67-81. 

Schmitt, A. P., and N. J. Dorans. 1991. "Factors Related to 
Differential Item Functioning for Hispanic Examinees on 
the Scholastic Aptitude Test." In Assessment and Access: 



16 



21 



Hispanics in Higher Education, ed. G. D. Keller, J. R. 
Deneen, and R. J. Magallan, pp. 105-32. New York: 
SUNY Press. 

Wendler, C. L. W., and S. T. Carlton. 1987. "An Examination 
of SAT-Verbal Items for Differential Performance by 
Women and Men: An Exploratory Study." Paper presented 
at the annual meeting of the American Educational Re- 
search Association, April, Washington, D.C. 

Wright, D. 1987. "An Empirical Comparison of the Mantel- 
Haenszel and Standardization Methods of Detecting Dif- 
ferential Item Performance." In Differential Item Function- 
ing on the Scholastic Aptitude Test (RM-87-01 ), ed. A. P. 
Schmitt and N. Dorans. Princeton, N.J.: Educational Test- 
ing Service. 

Zieky, M. 1991. "Using DIF Statistics inTD: Practical Issues." 
Paper presented at the annual meeting of the National 
Council on Measurement in Education, April, Chicago, 111. 



Appendix 



Summary of Hypotheses about DIF Relevant to the SAT- Verbal Items Selected for this Study 



Description of Hypothesis" 
(and References, if any) 



Technical/specialized science terminology may negatively 
affect the performance of females (Lawrence, Curley, and 
McHale 1988; Lawrence and Curley 1989; Scheuneman 
and Gerritz 1990) 



Total Number of 
Items Studied 



TabU of Items 
and Data 



Table 2 



Technical/specialized industrial arts terminology may 
negatively affect the performance of females (no references 
from research — based on empirical observation of SAT-V 
pretest results) 

Technical/specialized military terminology may negatively 
affect the performance of females (no references from 
research — based on empirical observation of SAT-V 
pretest results) 

Contexts portraying aggression or conflict may negatively 
affect the performance of females (Wendler and Carlton 
1987) 

Terminology of special interest or familiarity to a group 
may positively affect the performance of that group 
(Schmitt 1985, 1988; Schmitt and Bleistein 1987; Schmitt, 
Curley, Bleistein, and Dorans 1988; Bleistein, Schmitt, and 
Curley 1990) 

Cognates with Spanish may positively affect the performance 
of Hispanic examinees (Schmitt 1985, 1988; Schmitt, Curley, 
Bleistein, and Dorans 1988; Schmitt and Dorans 1991) 

Homographs may negatively affect the performance of 
Hispanic, black, and Asian American examinees (Schmitt 
1985, 1988; Schmitt and Bleistein 1987; Schmitt, Curley, 
Bleistein, and Dorans 1988; Bleistein, Schmitt, and Curley 
1990; Schmitt and Dorans 1991) 
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•Not all (or even most) SAT- Verbal items in these seven general categories consistently show elevated levels of DIF, b 
pattern:, have been detected. 
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